Human genetic clustering

Human genetic clustering analysis uses mathematical cluster analysis of the degree of similarity of genetic data between individuals and groups to infer population structures and assign individuals to groups that often correspond with their self-identified geographical ancestry. A similar analysis can be done using principal components analysis, which in earlier research was a popular method.[1] Many of recent studies in the past few years have returned to using principal components analysis.

Contents

Studies

Phylogenetic tree by Cavalli-Sforza & al. (1994)

Neighbor-joining method, by Naruya Saitou and Masatoshi Nei (2002)

Clusters by Rosenberg & al. (2006)

In 2004, Lynn Jorde and Steven Wooding argued that "Analysis of many loci now yields reasonably accurate estimates of genetic similarity among individuals, rather than populations. Clustering of individuals is correlated with geographic origin or ancestry."[3]

A study by Neil Risch in 2005 used 326 microsatellite markers and self-identified race/ethnic group (SIRE), white (European American), African-American (black), Asian and Hispanic (individuals involved in the study had to choose from one of these categories), to representing discrete "populations", and showed distinct and non-overlapping clustering of the white, African-American and Asian samples. The results were claimed to confirm the integrity of self-described ancestry: "We have shown a nearly perfect correspondence between genetic cluster and SIRE for major ethnic groups living in the United States, with a discrepancy rate of only 0.14%."(Tang, 2005)

Studies such as those by Risch and Rosenberg use a computer program called STRUCTURE to find human populations (gene clusters). It is a statistical program that works by placing individuals into one of two clusters based on their overall genetic similarity, many possible pairs of clusters are tested per individual to generate multiple clusters.[4] These populations are based on multiple genetic markers that are often shared between different human populations even over large geographic ranges. The notion of a genetic cluster is that people within the cluster share on average similar allele frequencies to each other than to those in other clusters. (A. W. F. Edwards, 2003 but see also infobox "Multi Locus Allele Clusters") In a test of idealised populations, the computer programme STRUCTURE was found to consistently under-estimate the numbers of populations in the data set when high migration rates between populations and slow mutation rates (such as single-nucleotide polymorphisms) were considered.[5]

Nevertheless the Rosenberg et al. (2002) paper shows that individuals can be assigned to specific clusters to a high degree of accuracy. One of the underlying questions regarding the distribution of human genetic diversity is related to the degree to which genes are shared between the observed clusters. It has been observed repeatedly that the majority of variation observed in the global human population is found within populations. This variation is usually calculated using Sewall Wright's Fixation index (FST), which is an estimate of between to within group variation. The degree of human genetic variation is a little different depending upon the gene type studied, but in general it is common to claim that ~85% of genetic variation is found within groups, ~6–10% between groups within the same continent and ~6–10% is found between continental groups. For example The Human Genome Project states "two random individuals from any one group are almost as different [genetically] as any two random individuals from the entire world."[6] Sarich and Miele, however, have argued that estimates of genetic difference between individuals of different populations fail to take into account human diploidity.

The point is that we are diploid organisms, getting one set of chromosomes from one parent and a second from the other. To the extent that your mother and father are not especially closely related, then, those two sets of chromosomes will come close to being a random sample of the chromosomes in your population. And the sets present in some randomly chosen member of yours will also be about as different from your two sets as they are from one another. So how much of the variability will be distributed where? First is the 15 percent that is interpopulational. The other 85 percent will then split half and half (42.5 percent) between the intra- and interindividual within-population comparisons. The increase in variability in between-population comparisons is thus 15 percent against the 42.5 percent that is between-individual within-population. Thus, 15/42.5 is 32.5 percent, a much more impressive and, more important, more legitimate value than 15 percent.[7]

Additionally, Edwards (2003) claims in his essay "Lewontin's Fallacy" that: "It is not true, as Nature claimed, that 'two random individuals from any one group are almost as different as any two random individuals from the entire world'" and Risch et al. (2002) state "Two Caucasians are more similar to each other genetically than a Caucasian and an Asian." It should be noted that these statements are not the same. Risch et al. simply state that two indigenous individuals from the same geographical region are more similar to each other than either is to an indigenous individual from a different geographical region, a claim few would argue with. Jorde et al. put it like this:

The picture that begins to emerge from this and other analyses of human genetic variation is that variation tends to be geographically structured, such that most individuals from the same geographic region will be more similar to one another than to individuals from a distant region.[3]

Whereas Edwards claims that it is not true that the differences between individuals from different geographical regions represent only a small proportion of the variation within the human population (he claims that within group differences between individuals are not almost as large as between group differences). Bamshad et al. (2004) used the data from Rosenberg et al. (2002) to investigate the extent of genetic differences between individuals within continental groups relative to genetic differences between individuals between continental groups. They found that though these individuals could be classified very accurately to continental clusters, there was a significant degree of genetic overlap on the individual level, to the extent that, using 377 loci, individual Europeans were about 38% of the time more genetically similar to East Asians than to other Europeans.

Infobox

Multi-Locus Allele Clusters

The existence of allelic clines and the observation that the bulk of human variation is continuously distributed, has led some scientists to conclude that any categorization schema attempting to partition that variation meaningfully will necessarily create artificial truncations. (Kittles & Weiss 2003). It is for this reason, Reanne Frank argues, that attempts to allocate individuals into ancestry groupings based on genetic information have yielded varying results that are highly dependent on methodological design.[8] Serre and Pääbo (2004) make a similar claim:

The absence of strong continental clustering in the human gene pool is of practical importance. It has recently been claimed that “the greatest genetic structure that exists in the human population occurs at the racial level” (Risch et al. 2002). Our results show that this is not the case, and we see no reason to assume that “races” represent any units of relevance for understanding human genetic history.

In a response to Serre and Pääbo (2004), Rosenberg et al. (2005) make three relevant observations. Firstly they maintain that their clustering analysis is robust. Secondly they agree with Serre and Pääbo that membership of multiple clusters can be interpreted as evidence for clinality (isolation by distance), though they also comment that this may also be due to admixture between neighbouring groups (small island model). Thirdly they comment that evidence of clusterdness is not evidence for any concepts of "biological race".[9]

Risch et al. (2002) state that "two Caucasians are more similar to each other genetically than a Caucasian and an Asian", but Bamshad et al. (2004)[10] used the same data set as Rosenberg et al. (2002) to show that Europeans are more similar to Asians 38% of the time than they are to other Europeans when only 377 microsatellite markers are analysed.

Percentage similarity between two individuals from different clusters when 377 microsatellite markers are considered.[11]
x Africans Europeans Asians
Europeans 36.5
Asians 35.5 38.3
Indigenous Americans 26.1 33.4 35

In agreement with the observation of Bamshad et al. (2004), Witherspoon et al. (2007) have shown that many more than 326 or 377 microsatellite loci are required in order to show that individuals are always more similar to individuals in their own population group than to individuals in different population groups, even for three distinct populations.[6]

Witherspoon et al. (2007) have argued that even when individuals can be reliably assigned to specific population groups, it may still be possible for two randomly chosen individuals from different populations/clusters to be more similar to each other than to a randomly chosen member of their own cluster. They found that many thousands of genetic markers had to be used in order for the answer to the question "How often is a pair of individuals from one population genetically more dissimilar than two individuals chosen from two different populations?" to be "never". This assumed three population groups separated by large geographic ranges (European, African and East Asian). The entire world population is much more complex and studying an increasing number of groups would require an increasing number of markers for the same answer. Witherspoon et al. conclude that "caution should be used when using geographic or genetic ancestry to make inferences about individual phenotypes."

Clustering does not particularly correspond to continental divisions. Depending on the parameters given to their analytical program, Rosenberg and Pritchard were able to construct between divisions of between 4 and 20 clusters of the genomes studied, although they excluded analysis with more than 6 clusters from their published article. Probability values for various cluster configurations varied widely, with the single most likely configuration coming with 16 clusters although other 16-cluster configurations had low probabilities. Overall, "there is no clear evidence that K=6 was the best estimate" according to geneticist Deborah Bolnick (2008:76-77).[12]

A study by the The HUGO Pan-Asian SNP Consortium in 2009 using the similar principle components analysis found that East Asian and South-East Asian populations clustered together, and suggested a common origin for these populations. At the same time they observed a broad discontinuity between this cluster and South Asia, commenting "most of the Indian populations showed evidence of shared ancestry with European populations". It was noted that "genetic ancestry is strongly correlated with linguistic affiliations as well as geography".

See also

References

  1. ^ Population Structure and Eigenanalysis, Nick Patterson, Alkes L. Price, David Reich, s. PLoS Genet 2(12): e190. doi:10.1371/journal.pgen.0020190
  2. ^ Saitou. Kyushu Museum. 2002. February 2, 2007
  3. ^ a b Lynn B Jorde & Stephen P Wooding, 2004, "Genetic variation, classification and 'race'" in Nature Genetics 36, S28–S33 Genetic variation, classification and 'race'
  4. ^ "Genetic Similarities Within and Between Human Populations" (2007) by D.J. Witherspoon, S. Wooding, A.R. Rogers, E.E. Marchani, W.S. Watkins, M.A. Batzer and L.B. Jorde. Genetics. 176(1): 351–359.
  5. ^ Wapples, R., S. and Gaggiotti, O. What is a population? An empirical evaluation of some genetic methods for identifying the number of gene pools and their degree of connectivity Molecular Ecology (2006) 15: 1419–1439. doi:10.1111/j.1365-294X.2006.02890.x
  6. ^ a b Genetic Similarities Within and Between Human Populations by D. J. Witherspoon, S. Wooding, A. R. Rogers, E. E. Marchani, W. S. Watkins, M. A. Batzer, and L. B. Jorde Genetics. 2007 May; 176(1): 351–359.
  7. ^ Sarich VM, Miele F. Race: The Reality of Human Differences. Westview Press (2004). ISBN 0-8133-4086-1
  8. ^ Back with a Vengeance: the Reemergence of a Biological Conceptualization of Race in Research on Race/Ethnic Disparities in Health Reanne Frank
  9. ^ Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, et al. (2005) Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure. PLoS Genet 1(6): e70 doi:10.1371/journal.pgen.0010070
  10. ^ Bamshad, Wooding, Salisbury§ and Stephens (2004) Deconstructing the relationship between genetics and race. Nature Reviews Genetics 8:598–609. doi:10.1038/nrg1401
  11. ^ The table gives the percentage likelihood that two individuals from different clusters are genetically more similar to each other than to someone from their own population when 377 microsatellite markers are considered from Bamshad et al. (2004)doi:10.1038/nrg1401, original data from Rosenberg (2002).
  12. ^ Bolnick, Deborah A. (2008). "Individual Ancestry Inference and the Reification of Race as a Biological Phenomenon". In Koenig, Barbara A.; Richardson, Sarah S.; Lee, Sandra Soo-Jin. Revisiting race in a genomic age. Rutgers University Press. ISBN 9780813543246.